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Abstract 


Using the complete genome sequences of 19 coronavirus genomes, we analyzed the codon usage bias, dinucleotide relative abundance and cytosine 
deamination in coronavirus genomes. Of the eight codons that contain CpG, six were markedly suppressed. The mean NNU/NNC ratio of the six amino 
acids using either NNC or NNU as codon is 3.262, suggesting cytosine deamination. Among the 16 dinucleotides, CpG was most markedly suppressed 
(mean relative abundance 0.509). No correlation was observed between CpG abundance and mean NNU/NNC ratio. Among the 19 coronaviruses, CoV- 
HKU 1 showed the most extreme codon usage bias and extremely high NNU/NNC ratio of 8.835. Cytosine deamination and selection of CpG suppressed 
clones by the immune system are the two major independent biochemical and biological selective forces that shape codon usage bias in coronavirus 
genomes. The underlying mechanism for the extreme codon usage bias, cytosine deamination and G+C content in CoV-HKU 1 warrants further studies. 


© 2007 Elsevier Inc. All rights reserved. 
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Introduction 


Codon usage bias is one of the most important indicators of 
the selective forces that shape genome evolution. In general, 
codon usage bias may be a result of mutation pressure and/or 
relative abundance of the corresponding acceptor tRNA 
molecules. For human RNA viruses, it has been observed in 
one study that codon usage bias was related to mutation pressure, 
G+C content, segmented nature of the genome and the route of 
transmission of the virus (Jenkins and Holmes, 2003). In other 
studies, it has been suggested that mutation pressure may result 
in bias in dinucleotide usage, such as CpG suppression, in small 
eukaryotic viruses (Karlin et al., 1994; Shackelton et al., 2006). 
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Other factors, such as cytosine deamination, which results in 
C—U changes, have also been proposed to be responsible for 
shaping the G+C contents and GC skews of RNA viruses (Pyrc 
et al., 2004). Recently, it has been observed that codon usage 1s 
an important driving force in the evolution of astroviruses and 
small DNA viruses (Sewatanon et al., 2007; van Hemert et al., 
2007). Despite all these fragmented observations, no study has 
integrated the various factors and been able to explain the basis 
for codon usage bias in viruses successfully. 

Coronaviruses are positive sense, single-stranded RNA 
(ssRNA) viruses found in a wide range of animals in which 
they can cause respiratory, enteric, hepatic and neurological 
diseases of varying severity. The sizes of the genomes of corona- 
viruses are about 30 kb, the largest among RNA viruses. Based on 
genotypic and serological characterization, coronaviruses were 
divided into three distinct groups (Brian and Baric, 2005; Lai and 
Cavanagh, 1997; Ziebuhr, 2004). As a result of the low fidelity of 
the RNA-dependent-RNA polymerases, the mutation rates of 
RNA virus genomes are high, in the order of | per 10,000 
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nucleotides replicated. Furthermore, the unique mechanism of 
viral replication has resulted in a high frequency of recombina- 
tion in coronaviruses (Lai and Cavanagh, 1997; Woo et al., 
2006b). Their tendency for recombination and high mutation 
rates have made their genomes highly plastic, allowed them to 
adapt to new hosts and ecological niches, and given them the 
potential to be good candidates for causing pandemics. These 
factors have made the study of coronavirus evolution particularly 
important, both biologically and for practical purposes (Grigor- 
iev, 2004; Gu et al., 2004; Yap et al., 2003). However, the relative 
importance of the various selective forces that shape the codon 
usage bias in coronaviruses and their underlying biological and 
biochemical basis are still poorly understood. 

The recent severe acute respiratory syndrome (SARS) 
epidemic, the discovery of SARS coronavirus (SARS-CoV) 
and identification of SARS-CoV-like viruses from Himalayan 
palm civets and a raccoon dog from wild live markets in China 
have led to a boost in interests in discovery of novel coronaviruses 
in both humans and animals (Guan et al., 2003; Marra et al., 2003; 
Peiris et al., 2003; Rota et al., 2003; Snijder et al., 2003; Woo et 
al., 2004). For human coronaviruses, in 2004, a novel group 1 
human coronavirus, human coronavirus NL63 (HCoV-NL63), 
was reported (Fouchier et al., 2004; van der Hoek et al., 2004); 
and in 2005, we described the discovery, complete genome 
sequence and molecular diversity of another novel group 2 human 
coronavirus, coronavirus HKU1 (CoV-HKU1) (Lau et al., 2006; 
Woo et al., 2005a,b,c, 2006b). As for animal coronaviruses, six 
group | (Poon et al., 2005; Tang et al., 2006; Woo et al., 2006a; 
Lau et al., 2007), six group 2, including bat SARS coronavirus, 
sable antelope coronavirus, giraffe coronavirus, and two new 
subgroups of group 2 coronaviruses (Lau et al., 2005; Li et al., 
2005; Woo et al., 2006a, 2007), 11 group 3 (Cavanagh et al., 
2002; East et al., 2004; Jonassen et al., 2005; Liu et al., 2005; 
Hasoksuz et al., 2007) coronaviruses, and two unclassified 
coronaviruses from Asian leopard cats and Chinese ferret badgers 
(Dong et al., 2007) have recently been described. Since the 
number of coronavirus species with complete genomes available 
has increased from 9 in 2003 to 19 in 2007, this has provided a 
golden opportunity to study genome evolution in coronaviruses. 

In this study, we analyzed the codon usage bias, dinucleotide 
relative abundance, cytosine deamination in coronavirus 
genomes and the codon usage bias in the hosts of the various 
coronaviruses. The relative importance of the various forces in 
shaping the codon usage bias in the various coronaviruses and 
the extreme codon usage bias and cytosine deamination in CoV- 
HKU 1 were also discussed. 


Results 
Codon usage in coronavirus genomes 


The mean (S.D.) effective number of codons (Nc) of the 19 
coronaviruses is 45.448 (4.207) (Table 1). The codon usage 
fractions in the 19 coronavirus genomes are shown in Table 2. 
For all amino acids, the codon usage patterns of every individual 
coronavirus species are similar to the general codon usage 
patterns in coronaviruses. CoV-HKU1, HCoV-NL63, murine 


hepatitis virus (MHV) and bat coronavirus HKUS (bat-CoV 
HKUS) are the four coronaviruses with relatively larger number 
of codons showing usage fractions outside the mean+2 S.D. 
usage fraction range of the corresponding codons, probably due 
to their relatively high (MHV and bat-CoV HKUS) or low (CoV- 
HKU1 and HCoV-NL63) G+C contents (Tables | and 2). 

To study the possible effect of CpG suppression on codon 
usage bias, the usage fractions of the eight codons that contain 
CpG (CCG, GCG, UCG, ACG, CGC, CGG, CGU and CGA) 
were analyzed. Of these eight codons, six [CCG (mean 0.058), 
GCG (mean 0.060), UCG (mean 0.038), ACG (mean 0.070), 
CGG (mean 0.038) and CGA (mean 0.060)] were markedly 
suppressed. CGC is slightly suppressed (mean 0.122) whereas 
CGU is over-represented (mean 0.322). 

To study the possible effect of cytosine deamination on codon 
usage bias, codons of amino acids that can use C or U in the 
codons were analyzed. For all amino acids that only use either 
NNU or NNC as codon (asparagine, histidine, aspartic acid, 
tyrosine, cysteine and phenylalanine), all NNU are markedly over 
represented with usage fractions of more than 0.700, whereas the 
usage fractions of all NNC are less than 0.300. For amino acids 
that use NNU, NNC or other codons (threonine, isoleucine, 
proline, leucine, alanine, glycine, valine and serine), the usage 
fractions of all NNU are at least three times more than those of the 
corresponding NNC. For leucine, UUA (mean 0.223) is used 
much more frequently than CUA (mean 0.081), and UUG (mean 
0.261) is used much more frequently than CUG (mean 0.072). 

To study the possible effect of AG transition on codon 
usage bias, codons of amino acids that can use A or G in the 
codons were analyzed. For amino acids that use either NNA or 
NNG as codons (lysine, glutamine and glutamic acid) and those 
that use NNA, NNG or other codons but excluding those codons 
with CpG (arginine, glycine and valine), the usage fractions of 
NNA are often higher than those of NNG, but the differences 
between the usage fractions of NNA and NNG are not as marked 
as those between the usage fractions of NNU and NNC. 


Codon usage in CoV-HKUI1 


Among all the 19 coronaviruses, CoV-HKU1 showed the 
most extreme codon usage bias. CoV-HKUI1 is the only 
coronavirus that showed Ne outside the mean+2 S.D. range. 
CoV-HKU 1 also possessed the lowest G+ C content, highest GC 
skew, lowest percentages of G and C and highest percentage of U 
among all coronavirus genomes (Table 1). For the six amino 
acids that only use either NNU or NNC as codon (asparagine, 
histidine, aspartic acid, tyrosine, cysteine and phenylalanine), 
amino acids that use NNU, NNC or other codons (threonine, 
isoleucine, proline, leucine, alanine, glycine, valine and serine), 
and for leucine that use UNN or CNN as codon, the average (S. 
D.) ratio of the usage fractions of the codons with U to those with 
C is 9.66 (2.49) (Table 2). For amino acids that use either NNA 
or NNG as codons (lysine, glutamine and glutamic acid) and 
those that use NNA, NNG or other codons but excluding those 
codons with CpG (arginine, glycine and valine), the average (S. 
D.) ratio of the usage fractions of the codons with A to those with 
G is 2.72 (0.57) (Table 2). 
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Table 1 
Coronavirus genomes used in the present study 
Coronavirus Host GenBank Reference 
accession 
no. 
Group la 
TGEV Pig NC_002306 Almazan et al., 2000 
FIPV Cat AY 994055 Haijema et al., 2003 
PRCV Pig DQ811787 Zhang et al., 2007 
Group 1b 
HCoV-229E Human NC_002645 Thiel et al., 2001 
HCoV-NL63 Human NC_00583 1 van der Hoek et al., 
2004 
PEDV Pig NC_003436 Kocherhans et al., 
2001 
BtCoV Bat DQ648858 Tang et al., 2006 
Bat-CoV HKU2 Bat EF203064 Lau et al., 2007 
Group 2a 
HCoV-OC43 Human NC_005147 Vigen et al., 2005 
CoV-HKU1 Human NC_006577 Woo et al., 2005b 
BCoV Cattle NC_003045 Chouljenko et al., 
2001 
PHEV Pig NC_007732 Viygen et al., 2006 
MHV Mouse NC_001846 Leparc-Goffart et al., 
1997 
Group 2b 
SARS-CoV Human NC_004718 Marra et al., 2003 
Bat-SARS-CoV Bat DQ022305 Lau et al., 2005 
HKU3 
Group 2c 
Bat-CoV HKU4 Bat EF065506 Woo et al., 2006b 
Bat-CoV HKUS5 Bat EF065511 Woo et al., 2006b 
Group 2d 
Bat-CoV HKU9 Bat EF065513 Woo et al., 2006b 
Group 3 
IBV Chicken NC_001451 Boursnell et al., 1987 


Codon usage in hosts of coronaviruses 


The codon usage fractions in the hosts of coronaviruses, 
including human, mouse, pig, cat and chicken, are shown in 
Table 3. To study the possible effect of CpG suppression on 
codon usage bias, the usage fractions of the eight codons that 
contain CpG (CCG, GCG, UCG, ACG, CGC, CGG, CGU and 
CGA) were analyzed. Among these eight codons, six (CCG, 
GCG, UCG, ACG, CGU and CGA) were suppressed, of which 
five were also suppressed in the coronavirus genomes. To study 
the possible effect of C< U transition and A= G transition on 
codon usage bias, codons of amino acids that can use C or U and 
those of amino acids that can use A or G in the codons were 
analyzed. No pattern of difference was observed between the 
use of NNU and NNC and between the use of NNA and NNG. 


Dinucleotide relative abundance in coronavirus genomes 


The relative abundance of the 16 dinucleotides in the 19 
coronavirus genomes are shown in Table 4. Among the 16 
dinucleotides, the relative abundance of CpG showed the most 
marked deviation from the “normal range” (mean+S.D.= 
0.509+0.063, 0.271 less than 0.78), with all 19 genomes 
showing CpG under-representation. In addition, the relative 
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27,317 38.2 0.129 216 272 34.6 16:7 44.281 
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28,033 42.0 0.086 22.8 24.7 Sue 19.2 48.424 
28,203 40.1 0.102 22:1) 26.2 cae 18.0 46.905 
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30,738 36.8 0.176 2127 21.6 35:6 15.2 43.791 
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30,480 aie 0.164 PAE ica 35.4 15.6 44.380 
31,357 41.7 0.142 2 26.0 32.3 179 1237 
29,751 40.7 0.020 20.8 28:5 30.7 20.0 49.423 
29,728 41.1 0.027 21.4 28.4 30.5 20.0 49.882 
30,286 37.8 0.093 2057 27.6 34.6 r7.1 44.585 
30,488 42.9 0.004 21.6 26.6 30.4 21.4 53.230 
29,114 41.0 0.138 2a3 25.3 33.7 7 46.162 
27,608 319 0.144 27 28.9 33,2 16.2 45.777 


abundance of UpG and CpA also showed slight deviation from 
the “normal range” (mean+S.D.=1.331+40.057 and 1.257+ 
0.070, respectively, both >1.23), with all 19 and 13 genomes 
showing UpG and CpA over-representation, respectively. 


Correlations between CpG suppression and cytosine 
deamination in coronaviruses 


The relationship between CpG suppression and cytosine 
deamination in the 19 coronavirus genomes is shown in Fig. 1. 
The mean (S.D.) of the NNU/NNC in the six amino acids that 
only use either NNC or NNU as the codons of the 19 coronavirus 
genomes is 3.262 (1.785). CoV-HKU1 showed extremely high 
NNU/NNC ratio of 8.835. No significant correlation was 
observed between CpG abundance and mean NNU/NNC ratio 
in the 19 coronavirus genomes (r=—0.339, P =0.156). 


Discussion 


Marked CpG suppression is observed in all coronavirus 
genomes. The discovery of Toll-like receptors (TLRs) that 
recognize pathogen-associated molecular patterns and the 
downstream molecular pathways was one of the biggest 
advances in the understanding of vertebrate innate immunity 


Table 2 


Codon usage fractions in coronaviruses 


Amino acid 


Lysine 
Asparagine 


Threonine 


Arginine 


Isoleucine 


Glutamine 
Histidine 


Proline 


Leucine 


Codon® 


AAA 
AAG 
AAC 
AAU 
ACA 
ACC 
ACG 
ACU 
AGA 
AGG 
CGA 
Cue 
CGG 
CGU 
AUA 
AUC 
AUU 
CAA 
CAG 
CAC 
CAU 
CCA 
Oc? 
CCG 
6. 
CUA 
CUC 


Codon usage fraction 


HCoV- 
229E 


0.561 
0.439 
0.311 
0.689 
0.356 
0.115 
0.045 
0.484 
0.365 
0.109 
0.049 
0.115 
0.030 
0.332 
0.252 
0.130 
0.618 
0.652 
0.348 
0.284 
0.716 
0.318 
0.108 
0.054 
0.520 
0.069 
0.053 


HCoV- 
NL63 


0.590 
0.410 
0.208 
0.792 
0.290 
0.075 
0.032 
0.603 
0.236 
0.141 
0.047 
0.084 
0.017 
0.475 
0.241 
0.062 
0.697 
0.656 
0.344 
0.206 
0.794 
0.303 
0.052 
0.029 
0.616 
0.041 
0.037 


TGEV 


0.610 
0.390 
0.347 
0.653 
0.409 
0.103 
0.064 
0.424 
0.464 
0.145 
0.035 
0.082 
0.035 
0.240 
0,230 
0.163 
0.607 
0.638 
0.362 
0.250 
0.750 
0.411 
0.061 
0.049 
0.479 
0.093 
0.088 


PEDV 


0.347 
0.653 
0.359 
0.641 
0.290 
0.172 
0.081 
0.457 
0.188 
0.152 
0.058 
0.195 
0.036 
0.371 
0.187 
0.212 
0.601 
0.436 
0.564 
0.320 
0.680 
0.327 
0.141 
0.049 
0.483 
0.080 
0.106 


PRCV 


0.610 
0.390 
0.346 
0.654 
0.398 
0.101 
0.076 
0.425 
0.477 
0.143 
0.037 
0.077 
0.030 
0.237 
0.239 
0.148 
0.613 
0.620 
0.380 
0.243 
0.757 
0.414 
0.064 
0.037 
0.485 
0.084 
0.087 


FIPV 


0.584 
0.416 
0.338 
0.662 
0.424 
0.104 
0.086 
0.385 
0.457 
0.136 
0.065 
0.077 
0.022 
0.244 
0.280 
0.167 
0.554 
0.613 
0.387 
0.287 
0.713 
0.364 
0.093 
0.050 
0.492 
0.108 
0.093 


BtCoV 


0.464 
0.536 
0.368 
0.632 
0.316 
0.177 
0.077 
0.431 
0.287 
0.169 
0.066 
0.097 
0.030 
0.350 
0.259 
0.181 
0.560 
0.551 
0.449 
0.389 
0.611 
0.338 
0.122 
0.059 
0.481 
0.061 
0.073 


Bat- 
CoV 
HKU2 


0.440 
0.560 
0.257 
0.743 
0.293 
0.128 
0.045 
0.534 
0.214 
0.193 
0.052 
0.125 
0.043 
0.373 
0.221 
0.133 
0.646 
0.403 
0.597 
0.266 
0.734 
0.298 
0.079 
0.066 
0.556 
0.058 
0.060 


HCoV- 
OC43 


0.511 
0.489 
0.172 
0.828 
0.318 
0.134 
0.055 
0.492 
0.305 
0.117 
0.076 
0.120 
0.050 
0.331 
0.323 
0.098 
0.579 
0.541 
0.459 
0.213 
0.787 
0.306 
0.124 
0.058 
0.512 
0.065 
0.049 


CoV- 
HKU! 


0.695 
0.305 
0.124 
0.876 
0.261 
0.057 
0.018 
0.664 
0.337 
0.098 
0.056 
0.075 
0.036 
0.399 
0.296 
0.056 
0.649 
0.690 
0.310 
0.105 
0.895 
0).239 
0.073 
0.031 
0.657 
0.036 
0.021 


BCoV 


0.509 
0.491 
0.166 
0.834 
0.300 
0.140 
0.062 
0.499 
0.299 
0.126 
0.070 
0.114 
0.053 
0.337 
0.320 
0.098 
0.582 
0.531 
0.469 
0.212 
0.788 
0.300 
0.133 
0.062 
0.504 
0.068 
0.049 


MHV 


0.392 
0.608 
0.257 
0.743 
0.261 
0.218 
0.134 
0.387 
0.249 
0.184 
0.075 
0.171 
0.057 
0.262 
0.304 
0.152 
0.544 
0.464 
0.536 
0.274 
0.726 
0.285 
0.221 
0.100 
0.395 
0.086 
0.085 


PHEV 


0.513 
0.487 
0.179 
0.821 
0.292 
0.144 
0.059 
0.504 
0.321 
0.117 
0.058 
0.143 
0.047 
0.315 
0.326 
0.102 
0.572 
0.517 
0.483 
0.221 
0.779 
0.288 
0.108 
0.064 
0.541 
0.066 
0.053 


SARS- 
CoV 


0.532 
0.468 
0.372 
0.628 
0.391 
0.132 
0.047 
0.430 
0.352 
0.157 
0.080 
0.128 
0.019 
0.264 
0.219 
0.213 
0.568 
0.604 
0.396 
0.359 
0.64] 
0.426 
0.093 
0.043 
0.439 
0.117 
0.136 


Bat- 
SARS- 
CoV 
HKU3 


0.541 
0.459 
0.366 
0.634 
0.380 
0.121 
0.073 
0.426 
0.383 
0.111 
0.060 
0.147 
0.030 
0.269 
0.229 
0.237 
0.534 
0.568 
0.432 
0.327 
0.673 
0.394 
0.122 
0.051 
0.433 
0.126 
0.135 


Bat- 
CoV 
HKU4 


0.557 
0.443 
0.237 
0.763 
0.305 
0.125 
0.065 
0.505 
0.266 
0.129 
0.071 
0.129 
0.041 
0.364 
0.315 
0.152 
0.533 
0.602 
0.398 
0.221 
0.779 
0.301 
0.081 
0.039 
0.578 
0.072 
0.069 


Bat- 
CoV 
HKUS5 


0.474 
0.526 
0.376 
0.624 
0.271 
0.173 
0.100 
0.457 
0.276 
0.135 
0.079 
0.173 
0.064 
0.273 
0.229 
0.294 
0.477 
0.550 
0.450 
0.361 
0.639 
0.306 
0.156 
0.069 
0.469 
0.115 
0.153 


Bat- 
CoV 
HKU9 


0.394 
0.606 
0.217 
0.783 
0.273 
0.144 
0.126 
0.458 
0.188 
0.195 
0.036 
0.137 
0.041 
0.404 
0.392 
0.099 
0.508 
0.440 
0.560 
0.233 
0.767 
0.261 
0.118 
0.109 
0.512 
0.086 
0.060 


IBV 


0.518 
0.482 
0.238 
0.762 
0.381 
0.080 
0.076 
0.463 
0.347 
0.155 
0.061 
0.134 
0.032 
0.271 
0.389 
0.088 
0.522 
0.623 
0.377 
0.320 
0.680 
0.386 
0.099 
0.086 
0.429 
0.109 
0.060 


Mean + S.D. 


0.518 + 0.086 
0.482 + 0.086 
0.276 + 0.083 
0.724 + 0.083 
0.327 + 0.054 
0.129 + 0.039 
0.070 + 0.029 
0.475 + 0.069 
0.316 + 0.087 
0.143 + 0.028 
0.060 + 0.014 
0.122 + 0.035 
0.038 + 0.013 
0.322 + 0.066 
0.276 + 0.058 
0.147 + 0.062 
0.577 + 0.054 
0.563 + 0.082 
0.437 + 0.082 
0.268 + 0.068 
0.732 + 0.068 
0.330 + 0.054 
0.108 + 0.040 
0.058 + 0.021 
0.504 + 0.065 
0.081 + 0.025 
0.077 + 0.035 


VeV 


Che-1 Eb (L007 69§ A8o]0MY4 / "JD 1a CO K Dd 


Glutamic acid 


Aspartic acid 


Alanine 


Glycine 


Valine 


Tyrosine 


Serine 


Cysteine 


Phenylalanine 


CUU 
CUG 
UUA 
UUG 
GAA 
GAG 
GAC 
GAU 
GCA 
ke 
GCG 
GCU 
GGA 
GGC 
GGG 
GGU 
GUA 
GUC 
GUG 
GUU 
UAC 
UAU 
UCA 
UCC 
UCG 
UCU 
AGC 
AGU 
UGC 
UGU 
UUC 
UUU 


0.283 
0.069 
0.184 
0.342 
0.676 
0.324 
0.368 
0.632 
0.277 
0.127 
0.058 
0.538 
0.115 
0.199 
0.030 
0.655 
0.120 
0.119 
0.190 
0.571 
0.326 
0.674 
0.187 
0.082 
0.028 
0.353 
0.087 
0.263 
0.282 
0.718 
0.206 
0.794 


0.311 
0.022 
0.281 
0.308 
0.662 
0.338 
0.218 
0.782 
0.266 
0.114 
0.025 
0.595 
0.073 
0.076 
0.023 
0.828 
0.106 
0.090 
0.078 
0.727 
0.190 
0.810 
0.192 
0.043 
0.010 
0.393 
0.044 
0.318 
0.091 
0.909 
0.105 
0.895 


0.366 
0.048 
0.204 
0.202 
0.740 
0.260 
0.339 
0.661 
0.334 
0.099 
0.034 
0.533 
0.179 
0.143 
0.032 
0.645 
0.194 
0.167 
0.162 
0.477 
0.391 
0.609 
0.217 
0.073 
0.020 
0.312 
0.111 
0.267 
0.316 
0.684 
0.292 
0.708 


0.316 
0.100 
0.129 
0.270 
0.462 
0.538 
0.357 
0.643 
0.282 
0.170 
0.075 
0.473 
0.104 
0.252 
0.051 
0.594 
0.121 
0.182 
0.193 
0.504 
0.335 
0.665 
Ulto 
0.122 
0.043 
0.313 
0.113 
0.235 
0.326 
0.674 
0.329 
0.671 


0.361 
0.044 
0.212 
0.212 
0.732 
0.268 
0.324 
0.676 
0.327 
0.127 
0.032 
0.513 
0.185 
0.136 
0.035 
0.644 
0.177 
0.160 
0.167 
0.496 
0.376 
0.624 
0.214 
0.070 
0.021 
0.319 
0.108 
0.268 
0.290 
0.710 
0.276 
0.724 


0.314 
0.063 
0.198 
0.224 
0.684 
0.316 
0.328 
0.672 
0.318 
0.129 
0.046 
0.507 
0.166 
0.147 
0.028 
0.658 
0.216 
0.145 
0.181 
0.458 
0.403 
0.597 
0.204 
0.094 
0.035 
0.276 
0.100 
0.291 
0.271 
0.729 
0.296 
0.704 


0.268 
0.085 
0.153 
0.360 
0.599 
0.401 
0.369 
0.631 
0.284 
0.191 
0.047 
0.478 
0.096 
0.195 
0.051 
0.657 
0.122 
0.151 
0.187 
0.539 
0.389 
0.611 
0.213 
0.101 
0.037 
0.317 
0.070 
0.262 
0.259 
0.741 
0.238 
0.762 


0.392 
0.059 
0.182 
0.249 
0.467 
0.533 
0.360 
0.640 
0.236 
0.130 
0.064 
0.569 
0.084 
0.172 
0.044 
0.700 
0.097 
0.125 
0.146 
0.632 
0.275 
0.725 
0.166 
0.082 
0.034 
0.340 
0.101 
0.277 
0.267 
0.733 
0.158 
0.842 


0.248 
0.073 
0.248 
0.316 
0.591 
0.409 
0.178 
0.822 
0.274 
0.138 
0.050 
0.538 
0.171 
0.159 
0.060 
0.610 
0.174 
0.082 
0.191 
0.553 
0.168 
0.832 
0.146 
0.074 
0.031 
0.302 
0.096 
0.351 
0.246 
0.754 
0.126 
0.874 


0.227 
0.023 
0.415 
0.278 
0.684 
0.316 
0.150 
0.850 
0.231 
0.089 
0.030 
0.650 
0.104 
0.107 
0.035 
0.755 
0.196 
0.056 
0.060 
0.688 
0.118 
0.882 
0.144 
0.038 
0.012 
0.454 
0.031 
0.322 
0.082 
0.918 
0.071 
0.929 


0.241 
0.069 
0.247 
0.325 
0.577 
0.423 
0.184 
0.816 
0.280 
0.140 
0.050 
0.530 
0.159 
0.152 
0.069 
0.620 
0.163 
0.084 
0.197 
0.556 
0.199 
0.801 
0.143 
0.071 
0.032 
0.327 
0.100 
0.328 
0.225 
0.775 
0.126 
0.874 


0.209 
0.110 
0.225 
0.285 
0.449 
0.551 
0.257 
0.743 
0.226 
0.217 
0.113 
0.444 
0.177 
0.268 
0.086 
0.470 
0.136 
0.140 
0.285 
0.440 
0.226 
0.774 
0.141 
0.105 
0.063 
0.239 
0.148 
0.304 
0.340 
0.660 
0.241 
0.759 


0.252 
0.065 
0.247 
0.316 
0.627 
0.373 
0.179 
0.821 
0.276 
0.148 
0.043 
0.533 
0.156 
0.184 
0.076 
0.584 
0.169 
0.089 
0.202 
0.540 
0.204 
0.796 
0.155 
0.088 
0.032 
0.290 
0.113 
0.322 
0.246 
0.754 
0.129 
0.871 


0.299 
0.094 
0.177 
0.178 
0.531 
0.469 
0.372 
0.628 
0.274 
0.145 
0.059 
0.522 
0.228 
0.252 
0.041 
0.479 
0.215 
0.171 
0.189 
0.424 
0.439 
0.561 
0.294 
0.066 
0.040 
0.322 
0.087 
0.190 
O375 
0.625 
0.377 
0.623 


0.301 
0.097 
0.182 
0.159 
0.552 
0.448 
0.420 
0.580 
0.284 
0.156 
0.067 
0.493 
0.233 
0.226 
0.059 
0.483 
0.199 
0.175 
0.196 
0.431 
0.428 
0.572 
0.275 
0.066 
0.046 
0.313 
0.091 
0.209 
0.344 
0.656 
0.404 
0.596 


0.265 
0.053 
0.308 
0.233 
0.627 
0.373 
0.262 
0.738 
0.293 
0.128 
0.056 
0.523 
0.131 
0.171 
0.065 
0.633 
0.226 
0.103 
0.157 
0.514 
0.222 
0.778 
0.214 
0.082 
0.033 
0.331 
0.077 
0.263 
0.197 
0.803 
0.201 
0.799 


0.290 
0.130 
0.148 
0.164 
0.567 
0.433 
0.371 
0.629 
0.318 
0.170 
0.082 
0.430 
0.205 
0.272 
0.067 
0.456 
0.158 
0.244 
0.197 
0.401 
0.436 
0.564 
0.200 
0.134 
0.077 
0.274 
0.113 
0.202 
0.373 
0.627 
0.429 
0.571 


0.194 
0.082 
0.271 
0.306 
0.496 
0.504 
0.296 
0.704 
0.257 
0.147 
0.132 
0.463 
0.074 
0.183 
0.072 
0.671 
0.210 
0.107 
0.221 
0.463 
0.255 
0.745 
0.162 
0.085 
0.071 
0.305 
0.074 
0.304 
0.270 
0.730 
0.177 
0.823 


0.298 
0.076 
0.235 
0.223 
0.604 
0.396 
0.299 
0.701 
0.373 
0.089 
0.080 
0.457 
0.198 
0.152 
0.043 
0.607 
0.224 
0.101 
0.172 
0.502 
0.266 
0.734 
0.207 
0.035 
0.048 
0.344 
0.064 
0.301 
0.180 
0.820 
0.213 
0.787 


0.286 + 0.052 
0.072 + 0.028 
0.223 + 0.066 
0.261 + 0.061 
0.596 + 0.089 
0.404 + 0.089 
0.296 + 0.081 
0.704 + 0.081 
0.285 + 0.037 
0.140 + 0.032 
0.060 + 0.028 
0.515 + 0.054 
0.149 + 0.051 
0.181 + 0.054 
0.051 + 0.018 
0.618 + 0.096 
0.170 + 0.042 
0.131 + 0.046 
0.177 + 0.048 
0.522 + 0.087 
0.297 + 0.102 
0.703 + 0.102 
0.192 + 0.042 
0.080 + 0.026 
0.038 + 0.018 
0.322 + 0.046 
0.091 + 0.027 
0.278 + 0.045 
0.262 + 0.082 
0.738 + 0.082 
0.231 + 0.104 
0.769 + 0.104 


“ Codons with CpG are in red and codons of amino acids that use either NNC or NNU as the codon are in green. (For interpretation of the references to colour in this table legend, the reader is referred to the 


web version of this article.) 
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Table 3 
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Codon usage fractions in different hosts of coronaviruses 


Amino acids 


Lysine 
Asparagine 


Threonine 


Arginine 


Isoleucine 


Glutamine 
Histidine 


Proline 


Leucine 


Glutamic acid 


Aspartic acid 


Alanine 


Glycine 


Valine 


Tyrosine 


Serine 


Cysteine 


Phenylalanine 


Codon* 


AAA 
AAG 
AAC 
AAU 
ACA 
ACC 
ACG 
ACU 
AGA 
AGG 
CGA 
CGC 
CGG 
CGU 
AUA 
AUC 
AUU 
CAA 
CAG 
CAC 
CAU 
CCA 
COC 
Be fe 
CCU 
CUA 
CUC 
CUT 
CUG 
UUA 
UUG 
GAA 
GAG 
GAC 
GAU 
GCA 
GCC 
GCG 
GCU 
GGA 
GGC 
GGG 
GGU 
GUA 
GUC 
GUG 
GUU 
UAC 
UAU 
UCA 
UCC 
UCG 
UCU 
AGC 
AGU 
UGC 
UGU 
UUC 
UUU 


Codon usage fraction 
Human (Homo sapiens) 


0.43 
0.57 
0.53 
0.47 
0.28 
0.36 
0.11 
0.25 
0.21 
0.21 
0.11 
0.18 
0.20 
0.08 
0.17 
0.47 
0.36 
0.26 
0.74 
0.58 
0.42 
0.28 
0.32 
0.11 
0.29 
0.07 
0.20 
0.13 
0.40 
0.08 
0.13 
0.42 
0.58 
0.54 
0.46 
0.23 
0.40 
0.11 
0.27 
0.25 
0.34 
0.25 
0.16 
0.12 
0.24 
0).46 
0.18 
0.56 
().44 
0.15 
0.22 
0.05 
0.19 
0.24 
0.15 
0.55 
0.45 
0.54 
0).46 


Mouse (Mus musculus) 


0.39 
0.61 
0.57 
0.43 
0.29 
0.35 
0.11 
0.25 
0.21 
0.22 
0.12 
0.17 
0.19 
0.09 
0.16 
0.50 
0.34 
0.25 
0.75 
0.60 
0.40 
0.29 
0.30 
0.10 
0.31 
0.08 
0.20 
0.13 
0.39 
0.06 
0.13 
0.40 
0.60 
0.56 
0.44 
0.23 
0.38 
0.09 
0.29 
0.26 
0.33 
0.24 
0.18 
0.12 
0.25 
0.46 
0.17 
0.57 
0.43 
0.14 
0.22 
0.05 
0.20 
0.24 
0.15 
0.52 
0.48 
0.56 
0.44 


Pig (Sus scrofa) 


0.38 
0.62 
0.61 
0.39 
0.23 
0.42 
0.14 
0.21 
0.19 
0.20 
0.10 
0.22 
0.21 
0.08 
0.14 
0.56 
0.30 
0.22 
0.78 
0.65 
0.35 
0.24 
0.36 
0.14 
0.26 
0.06 
0.22 
0.11 
0.45 
0.05 
0.11 
0.37 
0.63 
0.60 
0.40 
0.18 
0.45 
0.12 
0.24 
0.23 
0.37 
0.26 
0.14 
0.09 
0.27 
0.50 
0.50 
0.64 
0.36 
0.12 
0.25 
0.06 
0.17 
0.27 
0.13 
0.61 
0.39 
0.61 
0.39 


Cat (Felis catus) 


0.42 
0.58 
0.59 
0.41 
0.24 
0.40 
0.15 
0.21 
0.22 
0.23 
0.09 
0.19 
0.20 
0.07 
0.15 
O33 
0.31 
0.27 
0.73 
0.63 
0.37 
0.24 
0.37 
0.13 
0.26 
0.06 
0.21 
0.11 
0.44 
0.05 
O13 
0.42 
0.58 
0.58 
0.42 
0.19 
0.45 
0.13 
0.24 
0.25 
0.35 
0.24 
0.15 
0.10 
0.28 
0.47 
0.15 
0.61 
0.39 
0.12 
0.25 
0.06 
0.19 
0.24 
0.13 
0.57 
0.43 
0.59 
0.41 


Chicken (Gallus gallus) 


0.44 
0.56 
0.57 
0.43 
0.30 
0.31 
0.14 
0.25 
0.23 
0.21 
0.10 
0.19 
0.18 
0.10 
0.18 
0.46 
0.35 
0.27 
0.73 
0.60 
0.40 
0.28 
0.30 
0.14 
0.28 
0.06 
0.18 
0.13 
0.41 
0.08 
0.14 
0.43 
0.57 
0.49 
0.51 
0.27 
0.32 
0.12 
0.29 
0.27 
0.31 
0.24 
0.18 
0.13 
O22 
0.45 
0.21 
0.60 
0.40 
0.15 
0.20 
0.07 
0.18 
0.26 
0.14 
0.60 
0.40 
0.54 
0.46 


“ Codons with CpG are in red and codons of amino acids that use either NNC or NNU as the codon are in green. (For interpretation of the references to colour 
in this table legend, the reader is referred to the web version of this article.) 


in recent years. Among the TLR that recognize viral compo- 
nents, TLR3, 7, 8 and 9 detect viral nucleic acids (Bowie and 
Haga, 2005). It has been shown that TLR9 bound to CpG of 


double-stranded DNA and elicited the downstream inflamma- 
tory response, and administration of CpG oligodeoxynucleo- 
tides has been shown to protect mice from herpes simplex virus 2 


Table 4 
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Relative abundance of the 16 dinucleotides in the 19 coronavirus species with complete genomes available 


Coronavirus Relative abundance of the 16 dinucleotides* 
AA AC AG AU CA OG CG 

TGEV 1.080 1.176 0.954 0.865 1.316 0.830 0.456 
FIPV 1.067 1.221 0.945 0.869 1.301 0.923 0.501 
PRCV 1.083 1.185 0.956 0.863 1.307 0.860 0.462 
HCoV-229E 1.162 1.254 0.885 0.818 1.320 0.896 0.499 
HCoV-NL63 1.171 1.293 0.874 0.834 1.293 1012 0.416 
PEDV 1.065 1.201 0.976 0.865 1.349 0.922 0.548 
BtCoV 1.092 1.251 0.949 0.815 1.378 0.987 0.502 
Bat-CoV HKU2 1.048 1.243 0.940 0.881 1.266 0.921 0.509 
HCoV-OC43 1.050 1.024 0.985 0.946 1.239 1.168 0.485 
CoV-HKUI 1.086 1.106 0.965 = 0.923 1.106 1.183 0.445 
BCoV 1.052 1.049 0.987 0.945 1216 1.196 0.479 
PHEV 1.059 1.033 0.995 0.951 1.244 1.150 0.502 
MHV 1.079 0.945 1.013 0.952 1.160 1.217 0.607 
SARS-CoV 1.034 1.175 0.995 0.857 1.298 0.824 0.456 
Bat SARS-CoV HKU3 1.053 1.179 1.001 0.842 1.267 0.849 0.497 
Bat-CoV HKU4 1.037 1.122 0.997 0.911 1.186 1.025 0.508 
Bat-CoV HKUS5 1.088 1.071 0.974 0.902 1.229 0.873 0.605 
Bat-CoV HKU9 ().968 1.138 1.000 0.938 1.161 1.117 0.678 
IBV 1.053 1.132 1.068 0.833 1.238 0.990 0.512 


437 

CU GA GC GG GU UA UC UG UU 

LEG G82! 1.142 0.895 1.062 0.803 0.822 1401 = 1.016 
1.108 0.913 1.113 0.884 1.092 0.848 0.786 1.383 — 1.000 
1.136 0.939 1.157 0.886 1.062 0.812 O.811 1.382 = 1.016 
1.090 0.868 1.192 0.857 1.097 0.786 0.709 1.418 ~— 1.035 
1.098 0.817 L111 0.924 1.122 0.872 0.744 1.339 = 1.008 
1.098 0.852 1165 0.923 1.070 0.853 0.784 1.321 — 1.007 
1.038 0.846 1.080 0.921 1.127 0.826 0.741 1.342 ~~ 1.047 
1.136 0.886 1.233 0.872 1.039 0.903 0.695 1.360 0.990 
1.053 0.884 1.303 0.891 1.022 0.926 0.720 1.281 ~~ 1.002 
1.131 0.908 1.174 0.914 1.063 0.959 0.805 1.246 0.976 
1.067 0.887 1.289 0.904 1.020 0.935 0.699 1.292 (0.999 
1.050 0.877 1.329 0.891 1.015 0.931 0.706 1.301 0.997 
1.037. = 0.901 1.262 0.875 1.023 0.916 0.709 1.295 0.996 
1.205 0.944 1.153 0.924 0.986 0.800 0.846 1.409 1.018 
1.196 0.967 1.137 0.920 1.010 0.808 0.852 1.382 1.010 
1.132 0.857 1.214 0.886 1.075 0.952 0.777 1.298 0.960 
IZ 0.922 1.146 0.900 1.035 0.828 0.922 1.355 0.952 
1.039 0.780 1.236 0.902 1.095 1.079 0.637 1.235 0.959 
L115) = 0.877 1.194 0.913 1.082 0.906 0.762 1.249 1.034 


“ Numbers >1.23 and <0.78 are shown in red and green, respectively. (For interpretation of the references to colour in this table legend, the reader is referred 


to the web version of this article.) 


infections (Ashkar et al., 2003; Lund et al., 2003). Furthermore, 
it has been shown that CpG is under-represented in the genomes 
of small DNA viruses, which could be related to their evasion of 
the host immune systems (Karlin et al., 1994; Shackelton et al., 
2006). Although CpG suppression was also observed in RNA 
viruses, no known TLR has been shown to recognize CpG of 
ssRNA. However, recently it has been shown that ssRNA can 
stimulate human CD14°CD11lc” monocytes to produce large 
amounts of interleukin 12, but this activation of monocytes by 
CpG oligoribonucleotides was not mediated through TLR3, 7, 8 
or 9 (Sugiyama et al., 2005). The results suggested that CpG 
oligoribonucleotides may stimulate monocytes through a novel 
mechanism distinct from previously known immunostimulatory 
nucleic acids. In the present study, we showed that the mean CpG 
relative abundance in the coronavirus genomes is markedly 
suppressed (Table 4). This concurs with the results observed ina 
study on di- and trinucleotide frequencies in nine coronaviruses 
10 years ago (Tobler and Ackermann, 1998). The most logical 
way to avoid CpG 1s to mutate them to either UpG or CpA. This 
is in line with the observation that these two dinucleotides are 
over-represented in the coronavirus genomes, but their devia- 
tions from the upper limit of the “normal range” is not as 
remarkable as that of CpG from the lower limit of the “normal 
range”, as the CpG suppression pressure is equally shared by 
UpG and CpA over-representation. Interestingly, only CpG 
containing codons in the context of purine-CpG (ACG and 
GCG), pyrimidine-CpG (UCG and CCG) and CpG-purine 
(CGA and CGG) are suppressed (Table 2), whereas CpG- 
pyrimidine (CGU and CGC) are not. However, when trinucleo- 
tide frequencies were analyzed in the 19 coronavirus genomes, 
all the eight trinucleotides with CpG were suppressed (Fig. 2). 
This indicates that there 1s probably another force that has led to 
an increase use of CGU and CGC as codons for arginine, but this 
force does not act on trinucleotides over the whole genome in 


general. This force is probably unrelated to the relative 
abundance of the corresponding tRNA molecules in the hosts 
of the coronaviruses, as the pattern of bias in the hosts is not the 
same as that in the coronaviruses. 

In addition to CpG suppression, marked cytosine deamina- 
tion is also observed in all coronavirus genomes. Although it has 
been recognized that deamination of cytosine is a significant 
source of spontaneous mutations for a few decades (Duncan and 
Miller, 1980), DNA-cytosine deaminases, which are able to 
attack cytosines in single-stranded DNA, have only been 
discovered in the recent few years (Bransteitter et al., 2003; 
Sohail et al., 2003). The discovery of the ability to edit human 
immunodeficiency virus DNA, and subsequently RNA as well, 
by the human cytidine deaminase APOBEC3G has allowed the 
speculation that APOBEC-mediated cytosine deamination may 
contribute to the sequence variation of RNA viruses that 
replicate without any DNA intermediates (Bishop et al., 2004). 
GC skew, which reflects cytosine deamination, has been studied 
in various coronaviruses, and it has been shown that the GC 
skews of coronavirus genomes become less pronounced in the 
one third of the genome that encodes the structural proteins 
(Grigoriev, 2004; Pyrc et al., 2004). In the present study, using 
the six amino acids that are only encoded by NNU or NNC, 
hence excluding most other pressures that may affect the relative 
abundance of cytosine and uracil, we showed that all these NNU 
and NNC had usage fractions of >0.700 and <0.300, res- 
pectively (Table 2). In fact, for all codons that encode the same 
amino acid and with either C or U in any position, the usage 
fraction of the codon that uses U is invariably higher than the one 
that uses C in all coronaviruses. Furthermore, the percentage of 
C showed strong inverse relationships with the percentage of U 
in coronavirus genomes (r=—0.902, P<0.0001) (Fig. 3). All 
these suggest that cytosine deamination is an important 
biochemical force in shaping coronavirus evolution. 
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Fig. 1. Correlation between CpG dinucleotide abundance and NNU/NNC ratio in the 19 coronavirus genomes. 


Cytosine deamination and selection of CpG suppressed 
clones by the immune system are the two major independent 
biochemical and biological selective forces that shape codon 
usage bias in coronavirus genomes. Codon usage bias in 
coronaviruses is unrelated to the relative abundance of the 
corresponding tRNA molecules, as the patterns of bias in 
codon usage fractions in the hosts are not the same as those in 
the coronaviruses (Tables 2 and 3). Although others have tried 
to explain variations in codon usage in coronaviruses by 
compositional constraints (Gu et al., 2004), we think that both 
codon usage bias and nucleotide composition of the corona- 
virus genomes, which are apparently related to each other, are 
both results of other biological and biochemical selective 
forces, rather than nucleotide composition as a cause of codon 
usage bias. On the other hand, most of the codon usage bias in 
the coronaviruses can be easily explained by CpG suppression 
and cytosine deamination (Table 2). For asparagine, isoleucine, 
histidine, aspartic acid, glycine, valine, tyrosine, cysteine and 
phenylalanine, NNU are used more frequently than NNC 
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0.020 


Mean frequency 


0.010 
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because of cytosine deamination. For lysine, glutamine and 
glutamic acid, NNA are used slightly more frequently than 
NNG because of cytosine deamination in the minus strand 
during RNA replication. For threonine, ACG is suppressed 
because of CpG suppression and ACU is used more frequently 
than ACC because of cytosine deamination. For arginine, CGA 
and CGG are suppressed because of CpG suppression and 
CGU is used more frequently than CGC because of cytosine 
deamination. AGA is used more frequently than AGG and 
CGA is used more frequently than CGG because of cytosine 
deamination in the minus strand during RNA replication. For 
proline, CCG is suppressed because of CpG suppression and 
CCU is used more frequently than CCC because of cytosine 
deamination. For leucine, CUU is used more frequently than 
CUC, UUA is used more frequently than CUA, and UUG 1s 
used more frequently than CUG because of cytosine deamina- 
tion. For alanine, GCG is suppressed because of CpG 
suppression and GCU is used more frequently than GCC 
because of cytosine deamination. For serine, UCG is 


AAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCGGGGGGGGGGGGGGGGUUUUUUUUUUUUUUUU 
AAAACCCCGGGGUUUUAAAACCCCGGGGUUUUAAAACCCCGGGGUUUUAAAACCCCGGGGUUUU 
ACGUACGUACGUACGUACGUACGUACGUACGUACGUACGUACGUACGUACGUACGUACGUACGU 


Trinucleotide 


Fig. 2. Mean frequencies of 64 trinucleotides in the 19 coronavirus genomes. The dots and the bars represent the mean frequencies and the 95% confidence intervals of 
the trinucleotides. The dotted line represents the frequency of each trinucleotide (1/64=0.015625) if the bases are distributed in random. The CpG containing 


trinucleotides are in red. 
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Fig. 3. Correlations among mononucleotide frequencies in the 19 coronavirus genomes. The symbols for the various coronaviruses are the same as those used in Fig. 1. 


suppressed because of CpG suppression and UCU is used more 
frequently than UCC while ACU is used more frequently than 
ACC because of cytosine deamination. In addition to showing 
that CpG suppression and cytosine deamination are probably 
the two most important biological/biochemical forces that 
shape codon usage bias, we also demonstrated that these two 
forces are independent (Fig. 1), although cytosine deamination 
and subsequent selection of CpG suppressed clones by the 
immune system may be one of the mechanisms that has led to 
the resultant CpG suppression. Furthermore, we speculate that 
the species-specific number of CpG containing codons may not 
simply be the result of mutation pressure to avoid CpG, but an 
equilibrium between the immune pressure and the required 
number of CpG containing codons to serve biological 
functions such as to maintain RNA structure stability. Such 
an additional factor could explain the mere correlation between 
the NNU/NNC ratio and CpG dinucleotide abundance. 

The underlying mechanism for the extreme codon usage 
bias, cytosine deamination and G+C content in CoV-HKU1 is 
enigmatic. The contribution of cytosine deamination to genome 


evolution varies from very low to very high among the 19 
coronavirus genomes. For bat-CoV HKUS, SARS-CoV and bat- 
SARS-CoV, the mean NNU/NNC ratios are less than 1.7 (Fig. 
1). Codon usage bias in these coronaviruses is relatively mild 
(Ne of 53.23, 49.423 and 49.882, respectively; Table 1), and is 
mainly due to CpG suppression (Table 2). On the other hand, for 
CoV-HKU1, the mean NNU/NNC ratio is more than 8.8 (Fig. 
1), which is likely a result of rapid cytosine deamination. 
Although the biochemical basis for this extreme cytosine 
deamination is not known, this is probably the explanation for 
the extremely strong codon usage bias in CoV-HKU1 (Ne of 
35.671) and its lowest G+C content of 32% among all 
coronavirus genomes (Table 1). 


Materials and methods 
Coronavirus and host genomes 


One genome sequence of each of the 19 coronavirus species 
with complete genome sequence available was downloaded 


440 P.C.Y. Woo et al. / Virology 369 (2007) 431-442 


from the GenBank database (Table 1). The genomes of the hosts 
of the coronaviruses, including those of human, mouse, pig, cat 
and chicken, were also downloaded. 


Codon usage 


Codon usage bias was calculated according to the method 
described by Wright (1990). Using this method, when only one 
codon is used for each amino acid, Ne for the virus would be 20, 
and when all codons are used equally, the Nc for the virus would 
be 61. The codon usage fraction of a particular codon in a 
genome is calculated by the ratio of the number of that codon to 
the number of the amino acid that codon and other synonymous 
codons encode for in the protein coding sequence of the 
genome. The method for calculating codon usage bias 
accounting for background nucleotide composition (Nc’) 
(Novembre, 2002) was not used because it had been proposed 
to suffer from methodology problems, although those problems 
did not affect the conclusions which had been drawn by using 
Nc of this study (Fuglsang, 2006). 


Dinucleotide relative abundance in coronavirus genomes 


The relative abundance of the dinucleotides in the corona- 
virus genomes was assessed using the method described by 
Karlin and Burge (1995). The odds ratio pxy=fay/ffy, where fy 
denotes the frequency of the nucleotide X and f,, the frequency 
of the dinucleotide XY, etc., for each dinucleotide were 
calculated. From data simulations and statistical theory, 
Pxy 0.78 (extreme under-representation) or pxyy= 1.23 (ex- 
treme over-representation) occurs for sufficiently long (= 20 kb) 
random sequences with the probability at most 0.001 for 
virtually any base composition. 


Correlations between CpG suppression and cytosine 
deamination in coronaviruses 


To study possible correlations between CpG suppression and 
cytosine deamination in coronaviruses, the relative abundance 
of CpG and the mean ratio of NNC to NNU in the six amino 
acids (asparagine, histidine, aspartic acid, tyrosine, cysteine and 
phenylalanine) that only use either NNC or NNU as the codons 
(NNU/NNC ratio, representing contribution of cytosine deami- 
nation) were calculated for all 19 coronavirus genomes. 
Analysis of correlation between CpG deamination and NNU/ 
NNC ratio was performed using Pearson’s correlation (SPSS 
version 11.0). 
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